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402 



LMA invoked as corpus 
is loaded into nnemory 
and split into subsets 



Data structure generator 
assigns each item of subset to 
a node in data structure and 
calculates a frequency value for 
the item 
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Yes 



410 



Minimize data structure 
size utilizing pruning 
technique 
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500 



Build a prefix tree from the received corpus 



Build a seed language model (LM) from a tuning set of data 



Automatically segment a training set from the received corpus into 
N chunks satisfying a size range constraint, delivering the most 
similarity within the chunk and the most disparity between chunks 



Rank the chunks of the training set in order of increasing perplexity 

between the chunks 



Combine the training data to train a language model (LM) by 
combining the counts of different chunk sets weighted by 
similarities, or by building a distinct LM for distinct chunk sets and 
using an optimized interpolation weight to combine the models. 



Perform language model compression based, at least in part, on 
memory constraints and/or application requirements 
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Measure the statistical similarity within training blocl<s on both sides 

of a gap between each training unit (Cohesion score) 



71 

Measure the statistical disparity between training blocf<s on both 
sides of a gap between each training unit (Depth score) 



Select training chunk boundaries all gaps wherein the Depth Score 
reaches a predefined threshold 



Prune the training chunk selections based, at least in part, on a 
size range constraint 
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Calculate a perplexity score for each training chunk representing 
the predictive power of the seed LM to the training chunk 
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Augment tuning set with the 
top N training chunks 
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Filter the sorted training data based, at least in part, on memory 
and/or application constraints 
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Combine the remaining training data by either combining counts or 

by combining models 



Determine the Count for each 
of the training chunks / 
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Cluster the training chunks 
into a few clusters 



Combine the counts from all 
training chunks weighted by 
their rate / 
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Train a backoff LM for each 
cluster 
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Interpolate the results using 
estimate maximize weighting 



